Case Study 4: Bankruptcy Detection

Team: Ana Glaser, Jake Harrison, Rob Burigo, and Yvan Sojdehei

Business Understanding

In order to minimize the loss for the bank, we need to predict the likelihood that a company will go bankrupt in the future so that the bank divests their investments in the company before they go bankrupt.

Modeling Preparations

Methods:

The team is using both Random Forest and XGBoost. XGBoost works to optimize the error and given our unbalanced target class, it performs better than Random Forest. In preparation for the model, we loaded all of the .arff files into one dataframe and decoded the attributes which were formatted in a byte format. The data was cleansed and scaled prior to modeling.

Evaluation Metrics:

The use case requires accuracy be used as a metric, however, due to the unbalanced nature of the target class we will also measure precision and recall and weighted F1 as true measures of the model's ability to predict bankruptcy.

Importing Packages

In [1]:
import os
import pandas as pd
import numpy as np
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from scipy.io import arff
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier 
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from yellowbrick.model_selection import FeatureImportances
import seaborn as sns
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, classification_report, roc_auc_score,accuracy_score,average_precision_score, precision_recall_curve, plot_precision_recall_curve
import warnings
warnings.filterwarnings('ignore')

Data Evaluation / Engineering

Data files were loaded and consolidated. Every column contained Null values. The imputation technique used was to replace all null values with zero.

Importing Data. We consolidated all .arff files

In [2]:
data_file = os.listdir("./data/")
data_file
Out[2]:
['1year.arff', '2year.arff', '3year.arff', '4year.arff', '5year.arff']
In [3]:
my_data = pd.DataFrame()
data_df = pd.DataFrame()
for i in data_file:
    path = "./data/"+i
    print(path)
    data = arff.loadarff(path)
    data_df = pd.DataFrame(data[0])
    my_data = my_data.append(data_df)
    
    
my_data['class'] = my_data['class'].str.decode('utf-8') 
./data/1year.arff
./data/2year.arff
./data/3year.arff
./data/4year.arff
./data/5year.arff
In [4]:
my_data = my_data.rename(columns={'Attr1': 'net profit / total assets',
                                  'Attr2': 'total liabilities / total assets',
                                  'Attr3': 'working capital / total assets',
                                  'Attr4': 'current assets / short-term liabilities',
                                  'Attr5': '[(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365',
                                  'Attr6': 'retained earnings / total assets',
                                  'Attr7': 'EBIT / total assets',
                                  'Attr8': 'book value of equity / total liabilities',
                                  'Attr9': 'sales / total assets',
                                  'Attr10': 'equity / total assets',
                                  'Attr11': '(gross profit + extraordinary items + financial expenses) / total assets',
                                  'Attr12': 'gross profit / short-term liabilities',
                                  'Attr13': '(gross profit + depreciation) / sales',
                                  'Attr14': '(gross profit + interest) / total assets',
                                  'Attr15': '(total liabilities * 365) / (gross profit + depreciation)',
                                  'Attr16': '(gross profit + depreciation) / total liabilities',
                                  'Attr17': 'total assets / total liabilities',
                                  'Attr18': 'gross profit / total assets',
                                  'Attr19': 'gross profit / sales',
                                  'Attr20': '(inventory * 365) / sales',
                                  'Attr21': 'sales (n) / sales (n-1)',
                                  'Attr22': 'profit on operating activities / total assets',
                                  'Attr23': 'net profit / sales',
                                  'Attr24': 'gross profit (in 3 years) / total assets',
                                  'Attr25': '(equity - share capital) / total assets',
                                  'Attr26': '(net profit + depreciation) / total liabilities',
                                  'Attr27': 'profit on operating activities / financial expenses',
                                  'Attr28': 'working capital / fixed assets',
                                  'Attr29': 'logarithm of total assets',
                                  'Attr30': '(total liabilities - cash) / sales',
                                  'Attr31': '(gross profit + interest) / sales',
                                  'Attr32': '(current liabilities * 365) / cost of products sold',
                                  'Attr33': 'operating expenses / short-term liabilities',
                                  'Attr34': 'operating expenses / total liabilities',
                                  'Attr35': 'profit on sales / total assets',
                                  'Attr36': 'total sales / total assets',
                                  'Attr37': '(current assets - inventories) / long-term liabilities',
                                  'Attr38': 'constant capital / total assets',
                                  'Attr39': 'profit on sales / sales',
                                  'Attr40': '(current assets - inventory - receivables) / short-term liabilities',
                                  'Attr41': 'total liabilities / ((profit on operating activities + depreciation) * (12/365))',
                                  'Attr42': 'profit on operating activities / sales',
                                  'Attr43': 'rotation receivables + inventory turnover in days',
                                  'Attr44': '(receivables * 365) / sales',
                                  'Attr45': 'net profit / inventory',
                                  'Attr46': '(current assets - inventory) / short-term liabilities',
                                  'Attr47': '(inventory * 365) / cost of products sold',
                                  'Attr48': 'EBITDA (profit on operating activities - depreciation) / total assets',
                                  'Attr49': 'EBITDA (profit on operating activities - depreciation) / sales',
                                  'Attr50': 'current assets / total liabilities',
                                  'Attr51': 'short-term liabilities / total assets',
                                  'Attr52': '(short-term liabilities * 365) / cost of products sold)',
                                  'Attr53': 'equity / fixed assets',
                                  'Attr54': 'constant capital / fixed assets',
                                  'Attr55': 'working capital',
                                  'Attr56': '(sales - cost of products sold) / sales',
                                  'Attr57': '(current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation)',
                                  'Attr58': 'total costs /total sales',
                                  'Attr59': 'long-term liabilities / equity',
                                  'Attr60': 'sales / inventory',
                                  'Attr61': 'sales / receivables',
                                  'Attr62': '(short-term liabilities *365) / sales',
                                  'Attr63': 'sales / short-term liabilities',
                                  'Attr64': 'sales / fixed assets'})
In [5]:
my_data.head()
Out[5]:
net profit / total assets total liabilities / total assets working capital / total assets current assets / short-term liabilities [(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365 retained earnings / total assets EBIT / total assets book value of equity / total liabilities sales / total assets equity / total assets ... (sales - cost of products sold) / sales (current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation) total costs /total sales long-term liabilities / equity sales / inventory sales / receivables (short-term liabilities *365) / sales sales / short-term liabilities sales / fixed assets class
0 0.200550 0.37951 0.39641 2.0472 32.3510 0.38825 0.249760 1.33050 1.1389 0.50494 ... 0.121960 0.39718 0.87804 0.001924 8.4160 5.1372 82.658 4.4158 7.4277 0
1 0.209120 0.49988 0.47225 1.9447 14.7860 0.00000 0.258340 0.99601 1.6996 0.49788 ... 0.121300 0.42002 0.85300 0.000000 4.1486 3.2732 107.350 3.4000 60.9870 0
2 0.248660 0.69592 0.26713 1.5548 -1.1523 0.00000 0.309060 0.43695 1.3090 0.30408 ... 0.241140 0.81774 0.76599 0.694840 4.9909 3.9510 134.270 2.7185 5.2078 0
3 0.081483 0.30734 0.45879 2.4928 51.9520 0.14988 0.092704 1.86610 1.0571 0.57353 ... 0.054015 0.14207 0.94598 0.000000 4.5746 3.6147 86.435 4.2228 5.5497 0
4 0.187320 0.61323 0.22960 1.4063 -7.3128 0.18732 0.187320 0.63070 1.1559 0.38677 ... 0.134850 0.48431 0.86515 0.124440 6.3985 4.3158 127.210 2.8692 7.8980 0

5 rows × 65 columns

Displaying the summary of our dataset. Of note, we are dealing with attributes that are both positive and negative. As this is financial data, we will need to make sure not to look at these values as absolute.

In [6]:
my_data.describe()
Out[6]:
net profit / total assets total liabilities / total assets working capital / total assets current assets / short-term liabilities [(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365 retained earnings / total assets EBIT / total assets book value of equity / total liabilities sales / total assets equity / total assets ... working capital (sales - cost of products sold) / sales (current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation) total costs /total sales long-term liabilities / equity sales / inventory sales / receivables (short-term liabilities *365) / sales sales / short-term liabilities sales / fixed assets
count 43397.000000 43397.000000 43397.000000 43271.000000 4.331600e+04 43397.000000 43397.000000 43311.000000 43396.000000 43397.000000 ... 4.340400e+04 4.327800e+04 43398.000000 4.332100e+04 43398.000000 4.125300e+04 43303.000000 4.327800e+04 43271.000000 42593.000000
mean 0.035160 0.590212 0.114431 6.314702 -3.853466e+02 -0.056107 0.093478 12.640779 2.652166 0.626868 ... 7.672188e+03 -2.621959e+01 -0.010510 3.002644e+01 1.333288 4.480858e+02 17.033202 1.502328e+03 9.343074 72.788592
std 2.994109 5.842748 5.439429 295.434425 6.124303e+04 7.201326 5.713075 505.894281 62.932732 14.670597 ... 7.005310e+04 5.327862e+03 13.674072 5.334454e+03 122.104445 3.234560e+04 553.049406 1.392667e+05 124.177354 2369.339482
min -463.890000 -430.870000 -479.960000 -0.403110 -1.190300e+07 -508.410000 -517.480000 -141.410000 -3.496000 -479.910000 ... -1.805200e+06 -1.108300e+06 -1667.300000 -1.986900e+02 -327.970000 -1.244000e+01 -12.656000 -2.336500e+06 -1.543200 -10677.000000
25% 0.003429 0.268980 0.021521 1.049500 -4.908000e+01 0.000000 0.005776 0.430275 1.018500 0.295470 ... 2.755425e+01 9.348500e-03 0.014649 8.753200e-01 0.000000 5.545500e+00 4.510150 4.214400e+01 3.097650 2.176800
50% 0.049660 0.471900 0.196610 1.569800 -1.034500e+00 0.000000 0.059634 1.070400 1.195350 0.505970 ... 1.088350e+03 5.294300e-02 0.119670 9.509600e-01 0.006366 9.791700e+00 6.636300 7.132600e+01 5.087600 4.282500
75% 0.129580 0.688320 0.403390 2.787450 5.063425e+01 0.089446 0.150880 2.615700 2.062500 0.709100 ... 4.993325e+03 1.290975e-01 0.284605 9.926400e-01 0.236052 2.018100e+01 10.394500 1.172200e+02 8.598850 9.776200
max 94.280000 480.960000 28.336000 53433.000000 1.250100e+06 543.250000 649.230000 53432.000000 9742.300000 1099.500000 ... 6.123700e+06 2.931500e+02 552.640000 1.108300e+06 23853.000000 4.818700e+06 108000.000000 2.501600e+07 23454.000000 294770.000000

8 rows × 64 columns

Identification and imputation of Null values in the data

In [7]:
colms = np.where(my_data.isnull().sum() > 0)
print(colms)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], dtype=int64),)
In [8]:
my_data = my_data.fillna(0)
In [9]:
colms = np.where(my_data.isnull().sum() > 0)
print(colms)
(array([], dtype=int64),)

To understand the characteristics, we assesed mean attribute values in bankrupt vs. non-bankrupt companies. There appears to be clear distinction in some of the attribute values between our target classes. While building the model, we will explore which features are most influential in predicting bankrupt companies.

In [10]:
my_data.groupby(['class']).mean()
Out[10]:
net profit / total assets total liabilities / total assets working capital / total assets current assets / short-term liabilities [(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365 retained earnings / total assets EBIT / total assets book value of equity / total liabilities sales / total assets equity / total assets ... working capital (sales - cost of products sold) / sales (current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation) total costs /total sales long-term liabilities / equity sales / inventory sales / receivables (short-term liabilities *365) / sales sales / short-term liabilities sales / fixed assets
class
0 0.053098 0.543811 0.157384 6.404580 -366.252015 0.000040 0.113920 12.918564 2.688207 0.676952 ... 8020.298733 -27.279235 0.048536 31.429929 1.369922 441.145427 17.080733 1501.643189 9.337911 69.770166
1 -0.319401 1.504741 -0.734674 4.134217 -746.216002 -1.165242 -0.310769 6.584036 1.928654 -0.365093 ... 790.534411 -3.690713 -1.177105 1.090152 0.605021 124.056554 15.263197 1424.608922 8.846351 104.160628

2 rows × 64 columns

We will explore the initial data further by isolating some features we felt would be important to building a model and visualizing them with some boxplots. Notice there is clear separation between class 0 and class 1 for all three features. Looking at operating profit / expenses, you'll see non-bankrupt companies between 0 and 5 where the bankrupt companies are all between 0 and 1. The other two features have clear positive / negative separation.

In [11]:
sns.boxplot(x='class', y='profit on operating activities / financial expenses', data=my_data).set(title='Boxplot for profit on operating activities / financial expenses')
Out[11]:
[Text(0.5, 1.0, 'Boxplot for profit on operating activities / financial expenses')]
In [12]:
sns.boxplot(x='class', y='(gross profit + depreciation) / total liabilities', data=my_data).set(title='Boxplot for (gross profit + depreciation) / total liabilities')
Out[12]:
[Text(0.5, 1.0, 'Boxplot for (gross profit + depreciation) / total liabilities')]
In [13]:
sns.boxplot(x='class', y='gross profit (in 3 years) / total assets', data=my_data).set(title='Boxplot for gross profit (in 3 years) / total assets')
Out[13]:
[Text(0.5, 1.0, 'Boxplot for gross profit (in 3 years) / total assets')]
Upon investigating the data, we have noticed that a lot of the variables contain similar components. We can assume that features that share similar components could have a high correlation and be detrimental to the model. With that in mind, we took 10 of our most important features and plotted their correlation with each other. We see correlation in (net profit + depreciation) / total liabilities and (gross profit + depreciation) / total liabilities which makes sense. We did not find it necessary to remove any highly correlated features from the model, as there was only one variable in this category.
In [14]:
my_data_top = my_data[['profit on operating activities / financial expenses', '(gross profit + depreciation) / total liabilities',
                      'gross profit (in 3 years) / total assets','profit on sales / sales','(current assets - inventory) / short-term liabilities',
                      '(net profit + depreciation) / total liabilities','(gross profit + depreciation) / sales','retained earnings / total assets',
                      'total liabilities / ((profit on operating activities + depreciation) * (12/365))','profit on operating activities / sales']]
In [15]:
data_corr = my_data_top.corr()  # grabs correlation variables of features
mask = np.zeros_like(data_corr, dtype=bool)  # returns array of zeros w/ same shape and type of given array
mask[np.triu_indices_from(mask)]= True  # Generate a mask for the upper triangle

f, ax = plt.subplots(figsize=(11, 9))  # Matplotlib figure setup / formats nicely
sns.heatmap(data_corr,
            mask = mask,
            square = True,
            cmap = 'coolwarm', # Easier visualization of correlated variables
            annot = True,
            annot_kws = {'size': 12})
ax.set(title='Heatmap of Features with Significant Importanct to our Models')
Out[15]:
[Text(0.5, 1.0, 'Heatmap of Features with Significant Importanct to our Models')]

Modeling Preparation

We will build both a Random Forest Model and a XGBoost Model. The data will be down sampled due to the imbalance in the target class. Our assumption is it is more important to identify the possible bankrupt companies than it is to predict bankrupt vs. non-bankrupt. Thus we want a model that won't have to deal with imbalance.

We will evaluate the model performance based on the accuracy metric and evaluate the effectiveness of the models by using precision and recall, combined with the weighted F1 score, which is a balanced representation of these metrics .

Denoting the X and y

In [16]:
X = my_data.drop("class" , axis=1)
y = my_data['class']

Due to the imbalance in the target class, we have elected to go with a down sampling method in order to increase precision and recall while only taking a small hit to accuracy.

In [17]:
under_sample = RandomUnderSampler(sampling_strategy='all')
X_under, y_under = under_sample.fit_resample(X,y)

Our sampling methodoly is to perform a 70-30% train/test split on the merged dataframe that contains all datasets used.

In [18]:
X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_under, y_under, test_size=.30, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)

We scaled the data based on the standard scaler of the training dataset. The under sampled data will be used for both the XGBoost and Random Forest models.

In [19]:
scaler = StandardScaler()

X_train_scaler = scaler.fit(X_train)
X_train_scaled = X_train_scaler.transform(X_train)
X_train_scaled = pd.DataFrame(data = X_train_scaled, columns = X_train.columns)

X_test_scaled = X_train_scaler.transform(X_test)
X_test_scaled = pd.DataFrame(data = X_test_scaled,columns = X_test.columns )

X_train_scaler_u = scaler.fit(X_train_under)
X_train_scaled_u = X_train_scaler.transform(X_train_under)
X_train_scaled_u = pd.DataFrame(data = X_train_scaled_u, columns = X_train_under.columns)

X_test_scaled_u = X_train_scaler_u.transform(X_test_under)
X_test_scaled_u = pd.DataFrame(data = X_test_scaled_u,columns = X_test_under.columns )

Model Building & Evaluation

Random Forest

Random Forest is a tree-based ensemble method which, using a technique called bagging, combines the results of multiple decision tree iterations to reach a single optimal result. Random Forest is designed to help prevent overfitting by creating random splits of data.

Building a random forest model using 100 estimators and the 'entropy' criterion, based on the results of a grid search, not shown here.

In [29]:
dTree = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state=1)
dTree = dTree.fit(X_train_scaled_u, y_train_under)
y_pred = dTree.predict(X_test_under)
accuracy_score(y_pred,y_test_under)
Out[29]:
0.6717131474103586

The overall model accuracy is not extremely high, however the supporting performance metrics of weighted F1, class-level precision and recall emphasize the performance of the minority class, the purpose of our analysis, and show better results.

The Confusion Matrix displays the distribution of the classifications with the Random Forest model. With a recall score of .59 we observed that around 41% of the bankrupt companies were misclassified as not bankrupt, whereas 59% of the bankrupt companies were correctly predicted as bankrupt.

In [30]:
cnf_matrix = confusion_matrix(y_test_under, y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[30]:
Text(0.5, 352.48, 'Predicted label')

The confusion matrix and report below show a reasonable precision and recall score, capturing about three-quarters of all bankrupt companies, compared to the results of a random forest model with the original imbalanced data (without down-sampling)

In [31]:
print(classification_report(y_test_under, y_pred))
              precision    recall  f1-score   support

           0       0.66      0.75      0.70       641
           1       0.69      0.59      0.64       614

    accuracy                           0.67      1255
   macro avg       0.67      0.67      0.67      1255
weighted avg       0.67      0.67      0.67      1255

XGBoost

XGBoost is also a tree-based ensemble method, but rather than using bagging like Random Forest, XGBoost improves on a single model by combining it with other weak models using targeted results to minimize error. As said aboive, Random Forest prevents overfitting whereas XGBoost works to prevent underfitting and bias. While the random forest technique can be performed in parallel, the XGBoost algorithm operates in sequence, making it more computationally expensive.

Building an XGBoost model with 35 n estimators and using mae as criterion, as determined by a grid search, not shown here.

In [45]:
abcl = GradientBoostingClassifier(n_estimators=35,criterion = 'mae', random_state=1999)
abcl = abcl.fit(X_train_scaled_u, y_train_under)
y_pred_xgb = abcl.predict(X_test_under)
accuracy_score(y_pred_xgb, y_test_under)
Out[45]:
0.6231075697211156

This model has a slightly (5%) lower accuracy than the random forest model.

The Confusion Matrix displays the distribution of the classifications with the XGBoost model. With a recall score of .36 we observed that around 64% of the bankrupt companies were misclassified as not bankrupt, whereas 36% of the bankrupt companies were correctly predicted as bankrupt. This is a significant decrease in performance from the random forest model.

In [46]:
cnf_matrix = confusion_matrix(y_test_under, y_pred_xgb)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[46]:
Text(0.5, 352.48, 'Predicted label')
In [47]:
print(classification_report(y_test_under, y_pred_xgb))
              precision    recall  f1-score   support

           0       0.59      0.87      0.70       641
           1       0.73      0.36      0.48       614

    accuracy                           0.62      1255
   macro avg       0.66      0.62      0.59      1255
weighted avg       0.66      0.62      0.60      1255

Below are the top 10 most important features for our Random Forest model. You can see a few of the features have total assets or total liabilites as a denominator. This makes sense at a macro view as we are trying to identify companies that have a chance of going bankrupt therefore it is logical that assets and liabilities are key figures in that determination.

Investors are willing to overlook raw sales / cash numbers as long as the asset / liability ratio is in good health.

In [51]:
viz = FeatureImportances(dTree, relative=False,topn=10)
viz.fit(X_train_under, y_train_under)
Out[51]:
FeatureImportances(ax=<AxesSubplot:>,
                   estimator=RandomForestClassifier(criterion='entropy',
                                                    random_state=1),
                   relative=False, topn=10)

Case Conclusion

After exploring regular, smote (augmented) and down-sampled data on a Random Forest and XGBoost models, the team concluded that the down-sampled approach, using a random forest algorithm did the best job of effectively predicting companies that will go bankrupt.

There is a trade-off in the model performance, optimizing bankruptcy identification comes at the cost of falsely identifiying healthy companies as potential bankruptcies. A business decision should be made to determine the level of false positives the company is willing tolerate in order to minimize bankruptcy exposure risk.

Given the complexity of the data separation, more sophisticated classification models should be evaluated to improve the ability to predict future bankrupt companies. More research should be done on leveraging the leading indicators highlighted as significant to the model to further optimize the investment revenue realization.

In [ ]: